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ABSTRACT 



Recently, several person-fit statistics have been proposed 
to detect nonfitting response patterns. This study is designed to generalize 
an approach followed by Klauer (1995) to an adaptive testing system using the 
two-parameter logistic model ( 2PL) as a null model. The approach developed by 
Klauer is described, and some difficulties in generalizing it to a 
computerized adaptive testing model are explored. Alternative approaches are 
presented, and the results of a power study and the consequences for 
person- fit measurement in adaptive testing situations are discussed. The 
first part of the simulation concerns the ability to ignore the process that 
causes missing data and the impact of the fact that data in adaptive testing 
are not observed at random. Then the power of the proposed three tests is 
studied. Results suggest that tests against local independence in the 2PL 
have little power, but that the power for testing against nonvariant 
abilities is relatively larger but still low. (Contains 4 tables and 18 
references.) (SLD) 
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Statistical Tests for Person Misfit in Computerized Adaptive Testing 



Summary 

In case both the null and alternative model are exponential family models, an unbiased 
uniformly most powerful test can be based on the minimal sufficient statistics of the 
alternative model (Lehmann, 1986). Strictly speaking, this approach is not valid in the 
framework of the two parameter logistic model (2PL), since this model does not define an 
exponential family. However, the notion of using statistics related to the parameters of an 
alternative model as the basis of a test is intuitively appealing, even though a uniformly most 
powerful test probably does not exist. Below, 0 will be the weighted ML estimate (Warm, 
1989). In an adaptive testing environment, the following alternatives to the 2PL will be 
considered. 



1 Non-invariant abilities: in the alternative model it is assumed that the 2PL is valid during 
the whole testing session, but that the respondent's ability parameter changes during test- 
taking. More specifically, it will be assumed that a person has two person parameters, say 
0, and 0 2 ; the first parameter governs the responses on the first half of the test, the second 

parameter governs the second part. The test will be based on the statistic T = (r, ,7 2 ), 
where T\ and T 2 are the unweighted sum scores on the first and second part, respectively. 
Of course, splitting the test into two halves is arbitrary, the practitioner can use any non- 
trivial, suitable partition of the test. 



2 Lack of local stochastic independence: it is assumed that the probability of a correct 
response on an item is augmented by a previous correct response. This is modeled by 
introducing a transfer parameter <5, such that the probability of two consecutive responses 
x and jc i+ i is given by 



p{x, = JC, , X M = xje, a, P, 8) oc exp[x, (a,.0 - p, ) + x i+l (a i+1 0 - P 1+1 ) + x,x M 8] 
The test will be based on the statistic L = ^ X, X 1+) . 
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3 Person-specific item parameters. The test will be based on classifying the probability of the 

observed response pattern x=(jci,...,xi,...,x k ), p(x = x|§,a, (i), among all possible values 

p[y = y|§,a,(i), where Y is a possible response pattern. The statistic will be denoted by X. 

Significance probabilities for the tests based on 7, L and X are calculated by simulating their 
respective distributions under the null-hypothesis. Simulation studies were carried out to 
assess the power of the proposed tests. One important point here is that the item 
administration design is contingent on the responses given. It is well known (see, for instance, 
Mislevy, 1986, and Glas, 1988) that the stochastic nature of the design does not threaten the 
consistency of the estimates. This is due to the fact that in adaptive testing unobserved 
responses are missing at random (MAR, Rubin, 1976). Therefore, for the estimation of model 
parameters the process that causes missing data can be ignored. However, in adaptive testing 
the data are not observed at random (OAR, Rubin, 1976). As a result, the set of possible 
response patterns given the design is only a subset of the possible response patterns in a fixed 
design. The impact of these restrictions was one of the focuses of the simulation studies 
performed. 



Results 

1 For proper simulation of the distributions of 7, L and X the stochastic nature of the design 
has to be taken into account. 

2 The accuracy with which the distributions of 7, L and X can be simulated seems to depend 
on the model considered (RM versus 2PL) and the distribution of item parameters in the 
item bank. 

3 The power of the tests under the alternative model is very low. Only the power of the test 
based on 7 under the non-invariant abilities model reaches a non-trivial value. 

4 The error in the estimation of 6 is not much increased by the model violations introduced, 
so the estimates are quite robust. 



It must be stressed that the second and third result are preliminary: especially the simulation 
algorithm for the distribution of 7 is not yet completely satisfactory. 
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Introduction 

Person-fit analysis (or appropriateness measurement) is concerned with the detection of re- 
sponse patterns that are unusual (nonfitting) given what is expected under an item response 
theory (IRT) model. When a person’s response pattern is nonfitting according to the model, it 
is questionable whether the test score is an adequate description of the trait being measured. 
Typical forms of aberrant response behavior are guessing and cheating (e.g., as a result of item 
or test preview) which may result in spuriously high or spuriously low test scores. 

Recently, several person-fit statistics have been proposed to detect nonfitting response 
patterns (e.g., Drasgow, Levine, & Williams, 1985, Molenaar & Hoijtink, 1990, Meijer, 
1994). A popular approach is to determine the (log)likelihood of a response pattern and to 
classify response patterns with an extreme likelihood statistic as nonfitting. Note that follow- 
ing this approach, it is only tested whether a response pattern is unlikely given the model: the 
type of nonfitting response behavior is not identified. In the context of the Rasch (1960) 
model, Klauer (1995) addressed this problem by specifying alternative models for response 
behavior and testing the null hypothesis of model conform behavior against the specified al- 
ternative model. 

Klauer (1995) considered three types of nonfitting response behavior: violations of local 
stochastic independence between the items, violations of invariant ability across subtests of 
the total test, and person-specific item discrimination. For these three model violations, person 
tests were constructed and a power analysis was performed. A limitation of the Klauer (1995) 
study is that it was restricted to fixed-format tests and to the Rasch model. The present study 
was designed to generalize the approach followed by Klauer (1995) to an adaptive testing set- 
ting using the two-parameter logistic model (2PL) as a null model. 

This report is organized as follows. First, the approach by Klauer is presented in Section 2. 
Second, some problems with respect to the generalization of his approach to an adaptive test- 
ing situation are discussed. An alternative approach is presented in Section 3. Finally, the re- 
sults of a power study and the consequences for person-fit measurement in adaptive testing 
situations are discussed in Section 4. 
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The Rasch model and Uniformly Most Powerful Tests 



To test normal response behavior against a specified alternative model, Klauer (1995) con- 
structed uniformly most powerful (UMP) tests using the Rasch model. Because UMP tests are 
central in this approach the principle of these tests will be characterized first. 

The Rasch model is based on the assumptions of local stochastic independence, uni- 
dimensionality of the latent trait 0, sufficiency of a respondent’s unweighted sum score and 
some technical assumptions beyond the scope of the present paper. Let pi denote the diffi- 
culty for item i, where i=l, ...,k. Then the probability of a correct response on item i given 0, 
according to the Rasch model (RM) can be written as 



f?(e) = 



ex p{9 — p, } 

1 + expje-p,.}' 



( 1 ) 



It is well-known that the RM belongs to the exponential family of distributions. 

The general derivation of a UMP test in an exponential family model proceeds as follows. Let 
x=(xi,...,Xk) be a realization of X=(Xi,...,X k ) and let ^ and T] be two parameters. The likeli- 
hood of the two-parameter exponential family can then be written as, 



P(X = x|^T|) = ^Tj)Mx)exp{r|7tx) + £R(x)}, (2) 

where and h{x) are normalizing functions and f(x) and /?(x) sufficient statistics. Con- 

sidering the factorization criterion (e.g., Lindgren, 1993, p.231), the statistics T(x) and R{x) 
are minimal sufficient statistics for the parameters ^ and Tj. 

A uniformly most powerful (UMP) test <)) for testing H 0 :T|^r|o against Hf.rj^o for some 
parameter r| in exponential family of distributions can be derived as follows (Lehmann, 1959). 
Let the vector X be distributed according to the exponential family of distributions, i.e. Equa- 
tion 3 is generalized to 

p{x = x|e,Ti) = to(e,Ti) exp{ii7tx) + ix*,(x)|. (3) 

o 
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The UMP test for testing the hypothesis Ho:r|^no vs. Hr.r)/r|o with size a is given by 
satisfying 



<t>(x) = 



0 

y ,M 

1 



when Cj(r) < 7(x) < c 2 (r), 
when T{x) = c i (r) for i = 1 ,2, 
otherwise, 



(4) 



where R(x) = r is given and the functions c(r) and y(r) are determined by solving the fol- 
lowing equations 



£[<t>(X)|r,ti=ri 0 ] = a, 

£[r(X)<))(X)|r,Tl = T) 0 ] = a£[7tX)|r.Tl = Tfo]. 



So, Ho is accepted when the test statistic T takes values between c\ and C 2 , and randomization 
is applied in the case of T=c u for i=l ,2. 

This approach can be generalized to the framework of the RM as follows. Let Xi be the 
score on a dichotomously scored item i, where a 1 denotes a correct or keyed response and 0 
otherwise. Further, x=(xi,...,Xk) is a realization of a vector X=(Xj,...,Xk) of responses to k 
items, 0 is a person parameter, and (3j are item parameters. Then, the Rasch model as defined 
in Equation 1 can be written as 

xexp {-Ip ; jc. jxexpjoX*/ 

V V ' S V J 

fl(o) h{x) 

= \i(Q)h{x) exp{0/?(x)}, 

where /?(x) = ^jx. is the raw score of the response vector x; R is a minimal sufficient statis- 
tic for the ability parameter 0. 

Klauer (1995) constructed uniformly most powerful person fit tests for three specific alterna- 
tive models in a fixed format setting; these models are generalizations of the RM testing the 



p{ x = x|e) = j FI [ 1 + ex p { 0 - P, }] 
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assumptions of local stochastic independence, uni-dimensional abilities across subtests of the 
total test and the invariance of item discriminations across the items of the test. These alterna- 
tive models are member of the two-dimensional exponential family. Below, three different 
alternative models will be introduced. 

Non-Invariant Abilities across Subtests of the Total Test 

An assumption of the RM is the invariance of the ability parameter 0 . When the test is split 
into two subtests At and A2, the abilities on the two subtests will be the same, that is, 0i=02- 
The model used as an alternative model with invariant abilities contains a multidimensional 
ability parameter. Consider the simplest (two-dimensional) case, where the test is divided into 
two subtests A) and A2 and the ability parameter can be written as 0=(0i,02); thus the exami- 
nee has ability 0 * on subtest At and ability 0 2 on subtest A2. Let Rj(x) the raw score obtained 

from the subtest Aj for j=l, 2 , R{x) the raw score of the total test, and ft(0,T|) and h(x) suitable 
functions. Then, after some algebra and for known item parameters, the model with a multi- 
dimensional ability parameter can be written as, 



/?, (x) and R{x) are sufficient statistics for T|=0]-02 and 0=02 respectively. When T|=0, that is 
0]=02, the model becomes the RM. Positive values of T| indicate that 0i>02, this can be the 
case when an examinee has pre-knowledge about one subtest and therefore the score on that 
subtest is much higher than the score on the rest of the test. Thus, the parameter T| describes 
the size and direction of the differences of the ability parameters. Nonfitting response behav- 
ior like guessing, carelessness, sleeping and fumbling can also be the cause of non-invariant 
abilities. 

Local Stochastic Dependence 

Another assumption made for the RM is the assumption of local stochastic independence. The 
following model of Jannarone ( 1986 ) represents a violation of local stochastic independence 



p(x = x|0,T|) = |i.(0 2 ,0, -0j)Mx)exp{(0, -6,)/?,(x)+0 2 /?(x)}. 



(7) 




( 8 ) 
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where again jx(0,T|) and h{\) are normalizing functions. Therefore L(X) = 2L ^,+1 * s a suf- 

/=] 

ficient statistic for parameter T|. The RM is obtained for T|=0, positive values of r\ occur when 
for example an item provides extra information that is useful for answering the next item. 
Thus parameter T| describes the size and direction of the violation of the local stochastic inde- 
pendence assumption. 

Person-Specific Item Discrimination 

The third assumption of the RM is that the item discriminations are invariant across the items 
of the test. When T| is used as a parameter indicating the overall level of the item discrimina- 
tion, an alternative model is 

x = x|e,T|) = n [ 1 + exp{r|(e - p, )}] expl^jc,Ti(e-P,)|, (9) 

1=1 / V i=l J 

^(e.n) 

where again the item parameters are considered known. 

Forr|=l the RM is established, when 0<T)<1 the overall level of item discrimination is 
less then the overall discrimination level in the RM, and when T|>1 the overall level of item 
discrimination is higher than the overall level of discrimination in the RM. Thus, the parame- 
ter T| describes the size and direction of the violation of invariant item discriminations. 
Rewriting Equation 8 to the form of Equation 2, results in 

p[\ = x|0,T|) = |J.(0,T|) expj -Tl^PyX,. +0/?(x)i, (10) 



where T| is absorbed in 0, i.e. 0:=T10. Therefore, the statistics 



MX) = -£ p,X,.and R{x), are 



sufficient statistics for the person specific parameters T| and 0. 
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Limitations for the Two Parameter Logistic Model and Adaptive Testing 



The two-parameter logistic model (2PL) is a more general model than the RM. Let otj be the 
item discrimination of item i. Then the probability of a correct response to item i according to 
the 2PL is given by 



p(x,. = i|e)= 



expjct, (q — P,)} 

l+exp{a i (e-|3 j )}’ 



(ID 



and the joint distribution of the observed response pattern x can be written as 



^(x = x|e) =[n(l+exp{a,.(e-p j )})" j expj-Xa.P.-t Jj expjeXa^Jj 
= p(e)Mx) exp{0w(x)}, 



( 12 ) 



where W(X) is the weighted score according to x with otj as weights, where Otj are unknown 
item parameters to be estimated. In the 2PL, the total weighted score W is not a sufficient sta- 
tistic for the ability parameter 0, because W depends on oti. Further, the conditional distribu- 
tion of the sample, given W , depends on the unknown item parameters ex*. For known item pa- 
rameters Oti and Pi, W is a minimal sufficient statistic, although Andersen (1977) has shown 
that only for certain values of Oti, conditioning on W leads to a non-trivial likelihood. In that 
case, W can be used for the construction of UMP tests. 

In an adaptive testing situation it is difficult to construct sufficient statistics. Adaptive 
testing is the limiting case of multistage testing, where each test consists only one item. Glas 
(1988) showed for the RM in a two-stage testing design that the conditional probability of the 
sample X given the raw score R , is not independent of 0 and therefore R is not a sufficient 
statistic for parameter 0. 

Because the absence of a minimal sufficient statistic, it is not sure whether a UMP test 
for the RM exists in an adaptive testing situation. Even though a uniformly most powerful test 
probably does not exist, the notion of using statistics related to the parameters of an alterna- 
tive model as the basis of a test is intuitively appealing. This point will be resumed below. 
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Let the design, I, be the observed sequence an examinee responded to (a vector of item 
numbers). In an adaptive test, item selection is based on the responses on previous items. 
Therefore, the observed design I is dependent on the observed vector of responses X. The 
missing data are the responses on all the non-selected items. Ignorability of the process that 
causes missing data concerns the consistency of the estimates of the model parameters. In 
adaptive testing unobserved responses are missing at random (MAR, Rubin, 1976, Mislevy, 
1986), therefore, for the estimation of model parameters the process that causes missing data 
can be ignored. However, the data are not observed at random (OAR, Rubin, 1976). This 
means that the set of possible response patterns given the design, {X|l}, is only a subset of the 
possible response patterns in a fixed design. In a fixed design the probability of the observed 
sequence of items, g(l) , equals one, because every examinee responded to the same items. As 
a result /(X,l) = /(x|l)g(l) = /(X|l), where /(X,l) is the joint probability of X and I, and 
/(X|l) the conditional probability of X given I. In adaptive testing g(l) < 1, because the items 
administered are different for every examinee, therefore, /(X,l) = /(x|l)g(l) * /(X|i). 



Method 



This simulation study consists of two parts. The first part concerns the ignorability of the 
process that causes missing data and the impact of the fact that the data in adaptive testing are 
not OAR. Furthermore, the power of the proposed tests was investigated. 



Model-Fitting Response Vectors 

For each simulee, responses to dichotomous items were generated in the following way. The 
procedure starts with randomly drawing a true ability 0 from the standard normal distribution. 
/>(0) was computed according to the 2PL given in Equation 1 1. Then, a random number y 
was drawn from U(0,1) and when y response was set to 1, 0 otherwise. The first 

three items of the test were selected with item parameter p around 0, and the ability parameter 
was estimated using weighted maximum likelihood estimation (Warm, 1989) based on the 
responses to the first three items. The next item selected was the item with maximum infor- 
mation given the estimated ability 0 . Again, a response was generated, the ability was esti- 
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mated on the basis of previous item responses and the item with maximum information was 
selected. This procedure was repeated until the test consisted of 20 items. 

Types of Nonfitting Response Vectors 

Two different types of nonfitting response vectors were simulated. The first type of nonfitting 
vectors were generated with a two-dimensional ability parameter. Two values 0i and 02 were 
drawn from the bivariate standard normal distribution. The correlations p=0.6 and p=0.8 were 
used. During the first half of the test />(0,) was used and during the second half />(0 2 ) was 
used to generate the responses. 

The second type of nonfitting patterns were generated with violations against local sto- 
chastic independence. Item scores were generated according to a generalized version of the 
model proposed by Jannarone (1986). Let 8i (i+ i be a parameter modeling association between 
item, that is, the model can be written as 



The fact that 6 i i+] models association is verified as follows. From the model in Equation 13, 
the four possible realizations of (X^i+i) have the following probabilities 



Above it was stated that a test could be based on statistics related to the parameters of an al- 
ternative model. One could, for instance, construct a Lagrange Multiplier test (Aitchison and 
Silvey, 1958), which is based on the derivative of the logarithm of the right hand side of 
Equation 13 with respect to 8 i i+l . The test boils down to evaluating the difference between L 
and its expected value relative to its variance, all under the null model & u+] = 0. Since com- 
puting the variance is quite complicated here, this approach will not be pursued, and we will 




p(x,= 0,x j+I =0|e)oci, 

p(x j = l, X M = 0|e) « exp[a,0- p f ] , 

p(x, = 0 , X M = lie) « exp[a j+i e- P J , and 

p{x i =l,X M =l|e)«exp[(a j e— p j )+(a w e-p w )+8 j>j+ ,]. 
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proceed as follows. The conditional probability of item i+1, given the response to item i can 
be written as 



p(x i+1 = Ar j+1 |x i =Ar ( .e) = -^- 



p(x,=s„X M =s M |e) 

*, . x i+1 = i|e) + p [ x,. = x,. , x i+I = o|e) 



(14) 



Given the response to item i, the probability of a correct response to item i+1, 
p(x, +1 = l|x ( .,e), was computed. If p(x, +1 = l| X^e) > y, where y~I/(0,l), the response to item 
i+1 was set to 1, 0 otherwise. For generating response vectors, the values 5=0.2 and 8=0.4 
were used. 



Generating distributions under Hq 

For each response vector the realizations of 7, L and X were computed. To generate the distri- 
bution of the proposed statistics under the null model for each simulee, m response vectors 
(replications) were generated according to the null model (RM or 2PL). By generating the 
distribution of the statistics under the null model, it was possible to determine whether the ob- 
served response vector was classified as fitting or nonfitting. 

The m replications were generated using two different designs: a fixed design and a 
stochastic adaptive design. First, the replications were generated according to a fixed design. 
That is, for each replication the same items and the same test length were used. Let I be the 
observed sequence of items the simulee responded to. In the fixed design approach the condi- 
tional distribution of the observed response pattern X given the observed sequence of items I, 
/(X|I), was generated. Second, in the stochastic design approach the replications were gener- 
ated according to a stochastic adaptive design: given 0 , m response patterns were generated 
according to the adaptive procedure described above. The weighted conditional distribution 
/(X|l)g(I) with g(I) the probability of I, was taken into account. 

For determining the distribution of 7, the probability of each possible combination of 
(7 i, 7 2 ) was determined as 



P(r,,r 2 ) = 7(7, = r,,7 2 = t 2 ) = -^[number of replications with (f, v f 2 )]' 
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The sum of the probabilities of all replications with /’(x,, X 2 ) < p(t lJ t 2 ) i and 
/ ) (l,,X 2 ) > /’(fj.rJ were determined, where 

S 0 = X, , 2 ^( r i > r 2 ) f° r ^ replications with p(X x ,X 2 ) < P(t x J 2 ), 

S x = X, j 2 ^ 1 ^ 2 ) f° r ^ replications with /’(XpXj > />(*,, f 2 ), and 
S 2 = l-H 0 -H l . 

Then the probability p* = S x +uS 2 was determined, with u~U( 0,1). The observed response 
vector was classified as nonfitting the model when p* > 1 - a or when p* <a . 

For extreme values of L and X , that is L>ci and X>cx , with ci and cx the (l-a)100%- 
percentile of the m generated values of L and X respectively, the observed response vector x 
can be classified as nonfitting the model. 

The power of the test statistics 7, L and X was defined as the percentage of response 
patterns classified as non-fitting. The mean absolute bias was defined as the mean of the 

absolute distances between 0 and 0; MAB(0) = - The MAB was 

taken into account to investigate the differences between the true ability 0 and the final estimated 
ability 0 . 

An empirical item bank of 1,000 items calibrated using the 2PL was used, where the 
item parameters were estimated from real examination data in the Netherlands. For comparing 
the distributions using a fixed and a stochastic design for the Rasch model, an infinite item 
bank with items fitting the RM was used, that is, it was assumed that the optimal item was 
always present. The abilities were standard normally distributed: 0~N(O,1). Let n be the num- 
ber of response vectors simulated (simulees), and let m be the number of replications per 
simulee generated (replications). 



Results 



Distribution of the Statistics 

The first study was designed to assess the adequacy of using fixed design generation of statistics 
in the case of stochastic design data. In Table 1, for the Rasch model with n=300 and m- 1,000, 
and using an infinite item pool, the distribution of significance probabilities of the statistics 
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T, L, andX under the fixed and the stochastic generating approach are shown. Table 1 also 
shows the values of Pearsons’ chi-squared test X 2 . The expected percentages in all the cells are 
10%, and the expected value of X 2 is 9 (degrees of freedom). It can be seen that generation of 
the distribution of the statistics using a fixed design does not give a good description of the 
distribution under the null-hypothesis. This results in high X 2 values: for all three statistics, 
X 2 »9. In the case of the stochastic design, the distribution of X and L are satisfying 
according to the X 2 values of 9.133 and 3.467, respectively. However, the distribution of T is 
not completely satisfactory. The probability of 0.04 in the left tail is small compared to the 
expected probability of 0.10; the result of this is a high X 2 value of 24.8. Note, that the 
probabilities of the statistics in the tails of the distribution are most important, because the tails 
contain the fit values of aberrant response vectors. 

Table 2 shows the distribution of the statistics for the 2PL with n=500 and m=l,000. 
Again the fixed design does not give satisfactory results; the X 2 values were highly significant 
for all three distributions (not tabulated). This can also be seen in the tails: the left tail 
probabilities of X, L, and T are 0.036, 0.050, and 0.032, respectively, the right tail probabilities 
of X and L, 0.033 and 0.007, respectively. Thus, these probabilities are substantially smaller than 
0.10. Furthermore, note that the right-tail probability of T is 0.298, which is too high. Table 2 
also shows the distribution of the statistics under the stochastic design. In the case of the 
stochastic design for X, L, and T the probabilities in the left tail were also too small: 0.045, 
0.070, and 0.045. The probability in the right tail of the distribution of T was again too high: 
0.185. 

The estimated ability 0 is a point estimator and the precision of the estimation is 
different for every 0 . When it was assumed that ability is normally distributed with mean 0 and 
variance /(©), 0 ~ #(§,/(§)), the uncertainty of 0 is taken into account. When for the 

generation of the distribution m values of 0 were randomly drawn from /(§)), the 

distributions of the statistics did not improve. These results are also shown in Table 2: again, the 
X 2 values were highly significant (not tabulated). The probabilities in the left tail for both X and 
T were too small, 0.030 and 0.035, respectively, and the probability in the right tail for T (0.175) 
was too high. 
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Power Studies 

The second study was designed to investigate the power of the statistics. Table 3 shows the 
power of the statistics under the violations of the 2PL and the MAB for a 20 item test. Generally, 
there was hardly any power for all three statistics. This can be seen in the first column of Table 
3, for example, for p=0.8 the test against multi-dimensional abilities 7, classified only 8% of the 
true nonfitting response as aberrant; thus, the power of T was 0.08. However, for p=0.6 the 
statistic T classified 24% of the true nonfitting response vectors as aberrant. The test against 
local dependence L, detected only 4% of the true nonfitting simulees for 5=0.4. However, the 
estimated ability 0, was rather robust against these model violations. The values of the MAB 
under the model violations were not much higher than under the null model. The MAB 
increased with 0. 1 1 (from 0.34 to 0.45) for the multi-dimensional abilities with p=0.6. 

Table 4 shows the power of the statistics under the model violations, and MAB for a 40 
item test. Again, the test statistics had hardly any power. However, increasing the test length 
from 20 to 40 items resulted for a two-dimensional ability parameter in a 23% and 25% 
detection rate for p=0.8 and p=0.6 respectively. And the test against local dependence detected 
27% of the true nonfitting simulees for 5=1. Again 0 was rather robust against the violations of 
the model; in the worst case the MAB increased with 0. 17 (from 0.25 to 0.42). 



Discussion 



The results with respect to the power of a person-fit test in a CAT suggests that tests against 
local independence in the 2PL have little power, whereas the power for testing against 
noninvariant abilities was relatively larger but still low. The low power in CAT compared to 
conventional testing is not surprising because the overall variance of the item difficulties in CAT 
is smaller than in conventional testing and it has been shown (Reise and Due, 1991) that the 
smaller the variance of the item difficulties, the lower the power to detect nonfitting respondents. 

From a practical point of view the relatively low power of a person test may not be much 
of a problem if the 0 value is robust against some specific violations of the model. In that case, it 
is not necessary to classify a simulee as nonfitting because the 0 provides a reasonable good 
description of the testing behavior. In the context of a conventional test administration, there is 
some evidence that low power goes together with small bias of 0. Using simulated data, Meijer 
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and Nering (1996) showed that nonfitting responses may lead to biased estimation of 0, but that 
the bias of 0 depends on the 0 level and the type and severeness of the nonfitting response 
behavior. For example, they found that, for a 50 item test with 70% of the items fitting the 2PL, 
and random response behavior on 30% of the items, the bias was approximately 0 for 0=0, 
whereas for 0=2 and 0=-2 the absolute bias was approximately .9. The corresponding detection 
rates were .25 for 0=0 and .55 for 0=2 and 0=-2. Thus, in that study the low detection rates at 
0=0 did not seem to be much of a problem because the bias was approximately 0. If similar 
results apply for CAT applications, it may be interesting to investigate the trade off between test 
power and robust estimation of 0. 

Nering (1995) generated nonfitting responses by changing a 1 score into a 0 score and 
vice versa. Results showed that only for nonfitting response behavior in the beginning of the test, 
the detection rates were acceptable. Using a .05 two-tailed error rate, when the location of the 
response manipulation occurred within the first 5 items the detection rate was between .30 and 
.70. 

One limitation of the present study was that true item parameters were used. In a real 
testing situation item parameters must be estimated, and additional research is needed to 
determine what influence this estimation process will have on 0 values estimated from different 
procedures when nonfitting responses are present. 
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Power and MAD(Q) under various model violations, for 2PL. k= 100. m=1000, k= 20. 
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Power and MAD(Q) under various model violations, for 2 PL, n= 100 , m= 1000 . k= 40 . 
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